2024 鐵人賽 Day29: Ingest Pipeline - Tweeter Data - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2024 iThome 鐵人賽

DAY 0

自我挑戰組

重新開始 elasticsearch 系列第 28 篇

2024 鐵人賽 Day29: Ingest Pipeline - Tweeter Data

16th鐵人賽

kimcheng

2024-10-14 21:35:23

128 瀏覽

分享至

今天要來用 ES 的 Ingest Pipeline 處理前面用來做 Auto complete 的 Tweeter 資料，先回顧一下：

資料的樣子

{'name': 'MaggieBreathnac',
 'user_id': 487990281,
 'tweet': 'Making memories #nanny #anpost #isolation #happy #corona @ An Rinn '
          'https://t.co/byA9uSVxFc',
 'tweet_id': 1244969855058677766,
 'retweets': 0,
 'favorites': 0,
 'created': '31-Mar-2020',
 'followers': 522,
 'is_user_verified': False,
 'geo': {'type': 'Point', 'coordinates': [52.04681279, -7.56678938]},
 'coordinates': {'type': 'Point', 'coordinates': [-7.56678938, 52.04681279]},
 'location': 'dublin',
 'primary_location': {'type': 'Point',
                      'coordinates': [-7.56678938, 52.04681279]}}

希望他變成的樣子

{
		"name": 'MaggieBreathnac',
		"user_id": 487990281,
		"tweet": 'Making memories #nanny #anpost #isolation #happy #corona @ An Rinn '
          'https://t.co/byA9uSVxFc',
		"tweet_id": 1244969855058677766,
		"retweets": 0,
		"favorites": 0,
		"created": '31-Mar-2020',
		"followers": 522,
		"is_user_verified": False,
		"geo": [-7.56678938, 52.04681279],
		"location": 'dublin'
}

python 做了什麼處理

def data_to_es(json_file: str, index_name: str):
    with open(json_file, 'r') as f:
        data = json.load(f)
    fields_to_rm = ['primary_location', 'coordinates']
    for d in data:
        for f in fields_to_rm:
            if d.get(f):
                d.pop(f)
        if d.get('geo'):
            d['geo'] = d['geo']['coordinates'].reverse()
        if d.get('created'):
            d['created'] = datetime.strptime(d['created'], "%d-%b-%Y")
        d['_index'] = index_name
        yield d

移除 primary_location 和 coordinates 這兩個值
轉換 geo.coordinates 內的值並寫為 geo 欄位
解析 created 欄位為日期格式

這些轉換都蠻單純的，符合使用 ES ingest pipeline 的情境，以下是三個步驟的 Processor:

移除 primary_location 和 coordinates 這兩個值


  {
    "remove": {
      "field": [
        "primary_location",
        "coordinates"
      ]
    }
  }

轉換 geo.coordinates 內的值並寫為 geo 欄位

擷取 geo.coordinates fields 為 geo field


  {
    "json": {
      "field": "geo.coordinates",
      "target_field": "geo"
    }
  }

b. 將 geo 欄位內的經緯度反過來

  {
    "script": {
      "source": "ArrayList tmp = ctx[\"geo\"]; Collections.reverse(tmp); //ctx[\"geo\"]=[ctx[\"geo\"][1], ctx[\"geo\"][0]] ctx[\"geo\"] = tmp; "
    }
  }

解析 created 欄位為日期欄位


  {
    "date": {
      "field": "created",
      "formats": [
        "dd-MMM-yyyy"
      ],
      "target_field": "created"
    }
  }

這樣我們就完成了一條資料管線了，其中比較麻煩的是要的經緯度 arrayList 反序的 processor，需要理解 painless pipeline，但比起要為了這一條管線開一個環境安裝 python、維運 python script 單純多了。

這系列文章也終於告一個段落了，下一篇會來個總回顧~

2024 鐵人賽 Day28: Ingest Pipeline

2024 鐵人賽 Day30: I Made It

系列文

重新開始 elasticsearch 共 29 篇

RSS系列文訂閱系列文

2 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22195 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

重新開始 elasticsearch 系列 第 28 篇

2024 鐵人賽 Day29: Ingest Pipeline - Tweeter Data

尚未有邦友留言

標記使用者

重新開始 elasticsearch 系列第 28 篇